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Abstract. To precondition a large and sparse linear system, two direct methods for approximate 
factoring of the inverse are devised. The algorithms are fully parallelizable and appear to be more 
robust than the iterative methods suggested for the task. A method to compute one of the matrix 
subspaccs optimally is derived. Possessing a considerable amount of flexibility, these approaches 
extend the approximate inverse preconditioning techniques in several natural ways. Numerical ex- 
periments are given to illustrate the performance of the prcconditioners on a number of challenging 
^jrj benchmark linear systems. 
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1. Introduction. Approximate factoring of the inverse means parallelizable al- 
gebraic techniques for preconditioning a linear system involving a large and sparse 
nonsingular matrix A € C" x ". The idea is to multiply A by a matrix W from the 
right (or left) with the aim at having a matrix AW which can be approximated with 
an easily invertible matrix^ As opposed to the usual paradigm of preconditioning, 
iterations are not expected to converge rapidly for AW. Instead, the task can be 
interpreted as that of solving 

inf \\AWV~ 1 - l\\ w (1.1) 
W£W,vev n '* 

approximately by linearizing the problem appropriately |141 3] . Here W and V are 
nonsingular sparse standard matrix subspaces of C nx " with the property that that 
the nonsingular elements of V are assumed to allow a rapid application of the inverse. 
Approximate solutions to this problem can be generated with the power method as 
suggested in [4]. In this paper, direct methods are devised for approximate factoring 
based on solving 

min IIAW-FIL (1.2) 

when the columns of either W or V being constrained to be of fixed norm. These two 
approaches allow, once the matrix subspace W has been fixed, choosing the matrix 
subspace V in an optimal way. 

The first algorithm solves (|1.2p when the columns of V are constrained to be of 
fixed norm. Then the matrix subspaces AW and V are compared as such while other 
properties of A are largely overlooked. The second algorithm solves the problem when 
the columns of W are constrained to be of fixed norm, allowing taking properties of 
A more into account. In [4] the approach to this end was based on approximating the 
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smallest singular value of the map 

W\ — >{I-P V )AW (1.3) 

from W to C nx ™ with the power iteration. Here Py denotes the orthogonal projector 
on C" x " onto V. The second algorithm devised in this paper is a direct method for 
solving the task. 

The algorithms proposed extend the standard approximate inverse computational 
techniques in several ways. (For sparse approximate inverse computations, see [TJ 
Section 5], [12] and [16l Chapter 10.5] and references therein.) Aside from possessing 
an abundance of degrees of freedom, we have an increased amount of optimality if 
we suppose the matrix subspace W to be given. Then computable conditions can be 
formulated for optimally choosing the matrix subspace V. This is achieved without 
any significant increase in the computational cost. In particular, only a columnwise 
access to the entries of A is required @ 

We aim at maximal parallelizability by solving the minimization problem (|1.2|) 
columnwise. The cost of such a high parallelism is the need to have a mechanism 
to somehow control the conditioning of the factors. After all, parallelism means 
performing computations locally and independently. Also this can be achieved without 
any significant increase in the computational cost. 

Although the choice of the matrix subspace W is apparently less straightforward, 
some ideas are suggested to this end. Here we cannot claim achieving optimality, 
except that once done, thereafter V can be generated in an optimal way. In particular, 
because there are so many alternatives to generate matrix subspaces, many ideas 
outlined in this paper are certainly not fully developed and need to be investigated 
more thoroughly. 

The paper is organized as follows. In Section [5] two algorithms are devised for 
approximate factoring of the inverse. Section [3] is concerned with ways to choose the 
matrix subspace V optimally. Related stabilization schemes are suggested. In Section 
H heuristic Al schemes are suggested for constructing the matrix subspace W. In 
Section 5 numerical experiments are conducted. The toughest benchmark problems 
from [3] are used in the tests. 

2. Direct approximate factoring of the inverse. In what follows, two algo- 
rithms are devised for computing matrices W and V to have an approximate factor- 
ization 

A' 1 « WV- 1 (2.1) 

of the inverse of a given sparse nonsingular matrix A <E C" x " . The factors W and V 
are assumed to belong to given sparse standard matrix subspaces W and V of C nxn . 
A matrix subspace is said to be standard if it has a basis consisting of standard ba- 
sis matrices^ This allows maximal parallelizability by the fact that then the arising 
computational problems can be solved columnwise independently. Of course, paral- 
lelizability is imperative to fully exploit the processing power of modern computing 
architectures. 



2 Accessing the entries of the adjoint can be costly in parallel computations. 

3 Analogously to the standard basis vectors of C n , a standard basis matrix of C nxn has exactly 
one entry equaling one while its other entries are zeros. 
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2.1. First basic algorithm. Consider the minimization problem (|1.2p under 
the assumption that the columns of V are constrained to be unit vectors, i.e., of norm 
one. Based on the sparsity structure of W and the corresponding columns of A, the 
aim is at first choosing V optimally. Thereafter W is determined optimally. 

To describe the method, denote by Wj and Vj the jth columns of W and V . The 
column Vj is computed first as follows. Assume there can appear kj <C n nonzero 
entries in Wj at prescribed positions and denote by Aj £ C nx 3 ' the matrix with the 
corresponding columns of A extracted. Compute the sparse QR factorization 

Aj = QjRj (2.2) 

of Aj. (Recall that the sparse QR- factorization is also needed in sparse approximate 
inverse computations.) Assume there can appear lj <C n nonzero entries in Vj at 
prescribed positions and denote by Mj £ C a a the matrix with the corresponding 
columns of Q* extracted. Then Vj, regarded as a vector in C lj , of unit norm is 
computed satisfying 

\\MjVj\\ = \\Mj\\, (2-3) 

i.e., Vj is chosen in such a way that its component in the column space of Aj is as 
large as possible. This can be found by computing the singular value decomposition 
of Mj . (Its computational cost is completely marginal by the fact that Mj is only a 
kj-by-lj matrix.) 

Suppose the column Vj has been computed as just described for j = l,...,n. 
Then solve the least squares problems 

min 1 1 Aj Wj — Vj\\ 2 (2-4) 

to have the column Wj of W . 

For each pair Vj and Wj of columns, the computational cost consists of computing 
the sparse QR factorization (|2.2p and, by using it, solving (|2.3|) and (|2.4p . For the 
sparse QR factorization there are codes available [5] . (Now Aj has the special property 
of being very "tall and skinny".) 

The constraint of requiring the columns of V to be unit vectors is actually not 
a genuine constraint. That is, the method is scaling invariant from the right and 
thereby any nonzero constraints are acceptable in the sense that the condition (|2.3I) 
could equally well be replaced with ||MjVj|| = r 3 -||Mj||. Let us formulate this as 
follows. 

Theorem 2.1. Assume A £ C" xn is nonsingular. If V and W are standard 
matrix subspaces of <C nxn , then the factorization (|2.ip computed as described is in- 
dependent of the fixed column constraints \\vj\\ 2 = Tj > for j = 1, . . .,n. 

Proof. Let W and V be the matrices computed with the unit norm constraint for 
the columns of V. Let W and V be computed with other strict positivity constraints 
for the columns of V, i.e., (|2.3p is replaced with the condition 

11^-11= rj\\Mj\\. (2.5) 

Then we have V — VD and W = WD for a diagonal matrix D with nonzero entries. 
Consequently, WV~ l = WV~ X whenever the factors are invertible. □ 
Corollary 2.2. If a matrix V solving 

min IL4W-VIL. (2.6) 

w&v, vev, \\v\\ F =i * 
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is nonsingular, then the factorization (|2.1[) coincides with the one computed to satisfy 
(E3D and (gg)) . 

Proof. Suppose W and V solve (|2.6|1 . Since V is invertible, we have ||t>j|| 2 = rj > 

0. Using these constraints, compute W and V to satisfy (|2.5[) and (|2.4|) . This means 
solving (|2.6p columnwise and thereby the corresponding factorizations coincide. □ 

It is instructive to see how the computation of an approximate inverse relates 
with this. (For sparse approximate inverses and their historical development, see [TJ 
Section 5].) 

Example 2.1. In approximate inverse computations, the matrix subspace V is as 
simple as possible, i.e., the set of diagonal matrices. Regarding the constraints, the 
columns are constrained to be unit vectors. Therefore one can replace V with the 
identity matrix, as is customary. See also Example 13.21 below. 

2.2. Second basic algorithm. Consider the minimization problem (|1.2[) under 
the assumption that the columns of W are constrained to be unit vectors instead. 
Based on the sparsity structure of W and the corresponding columns of A, the aim now 
is at first choosing W optimally. Thereafter V is determined optimally. The resulting 
scheme yields a direct analogue of the power method suggested in [4j. However, the 
method proposed here has at least three advantages. First, being direct, it seems to 
be more robust since there is no need to tune parameters used in the power method. 
Second, the Hermitian transpose of A is not needed. Third, the computational cost is 
readily predictable by the fact that, in essence, we only need to compute sparse QR 
factorizations. 

To describe the method, denote by Wj and Vj the jth columns of W and V. The 
column Wj is computed first as follows. Assume there can appear kj -C n nonzero 
entries in Wj at prescribed positions and denote by Aj 6 c™ xfc j the matrix with the 
corresponding columns of A extracted. Assume there can appear lj <C n nonzero 
nonzero entries in Vj at prescribed positions and denote by Aj g ^(n-ij)xkj 
matrix with the corresponding rows of Aj removed. Then take Wj to be a right 
singular vector corresponding to the smallest singular value of Aj . 

To have Wj inexpensively, compute the sparse QR factorization 

= QjRj 

of Aj . Then compute the singular value decomposition of Rj . Of course, its compu- 
tational cost is completely negligible. (However, do not form the arising product to 
have the SVD of Aj explicitly.) Then take Wj from the singular value decomposition 
ofRj. 

Suppose the column Wj has been computed as just described for j = 1, . . . ,n. 
Then, to have the columns of V, set 

V = P v A[ Wl ---w n ], 

1. e., nonzero entries are accepted only in the allowed sparsity structure of Vj. 

For an analogue of Corollary 12.21 assume a matrix corresponding to the smallest 
singular value of the linear map (|1.3[) is nonsingular. Since W is a standard matrix 
subspace, the computations can be performed columnwise. The resulting W can be 
chosen to coincide, once divided by y/n, with this matrix. 
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2.3. Some general remarks. In approximate inverse preconditioning, it is well- 
known that it can make a difference whether one computes a right or left approximate 
inverse Q~| pp. 449-450]. As we have generalized this technique, this is the case 
with the approximate factoring of the inverse also. Here we have considered only 
preconditioning from the right. 

The usage of standard matrix subspaces leads to maximal parallelizability. In 
view of approximating the inverse, this means that computations are done locally 
(columnwise) and independently, i.e., without any global control. To compensate for 
this, with an eye to improve the conditioning of the factors, it seems advisable to 
impose additional constraints. This is considered in Section [3j 

The simultaneous (somehow optimal) choice of the matrix subspaces W and V 
is a delicate matter. In [3] we gave a rule thumb according to which the sparsity 
structures of the matrix subspaces should differ as much as possible in approximate 
factoring of the inverse. (This automatically holds in computing approximate inverses 
and ILU factorizations.) Numerical experiments seem to support this. Although we 
do not quite understand the reasons for this, it is partially related with the fact that 
then there are very few redundancies in the factorizations (|2.1j) as follows. 

Proposition 2.3. Let V and W be standard nonsingular matrix subspaces of 
C nx ™ containing the identity. If in the complement of the diagonal matrices the 
intersection of V and W is empty, then the maximum rank of the map 

(V,W)i — >WV- 1 (2.7) 

onVxWfl GL(n, C) is dim V + dim W - n. 

Proof. Linearize the map (|2.7p at (V, W) for both V and W invertible. Using the 
Neumann series yields the linear term 

W{W~ X W - V~ X V)V~ X . 

At (V, W) = (I, I) the rank is dim V + dimW — n. It is the maximum by the fact 
that for any nonsingular diagonal matrix D we have (VD, WD) i — > WV~ 1 , i.e., the 
map (|2.7|) can be regarded as a function of dimV + dimW — n variables. □ 

Aside from this basic principle, more refined techniques are devised for simulta- 
neously choosing the matrix subspaces W and V in the sections that follow. Most 
notably, optimal ways of choosing V are devised. 

3. Optimal construction of the matrix subspace V and imposing con- 
straints. For the basic algorithms introduced, a method for optimally choosing the 
matrix subspace V is devised under the assumption that the matrix subspace W has 
been given. Moreover, mechanisms are introduced into the basic algorithms that allow 
stabilizing the scheme for better conditioned factors. (In approximate inverse precon- 
ditioning the latter task is accomplished in the simplest possible way: the subspace 
V is simply CI, i.e., scalar multiples of the identity.) 

3.1. Optimally constructing the matrix subspace V. Suppose the matrix 
subspace W has been given. Then the condition (|2.3p yields a columnwise criterion 
for optimally choosing the sparsity structure of the matrix subspace V. (Recall that 
it must be assumed that the nonsingular elements of V allow a rapid application of 
the inverse.) Once done, proceed by using one of the basic algorithms to compute the 
factors. 

Consider (|2.3p . It is beneficial to choose the sparsity structure of Vj in such a way 
that the norm of Mj is as large as possible, with the constraint that in the resulting 
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V the nonsingular elements are readily invertible. In other words, among admissible 
columns of Q* , take lj columns which yields Mj with the maximal norm. This means 
that for the optimization problem (|1.2[) . with a fixed matrix subspace W, the matrix 
subspace V is constructed in an optimal way. 

Certainly, the problem of choosing lj columns to maximize the norm is combina- 
torial and thereby rapidly finding a solution does not appear to be straightforward. A 
suboptimal choice for the matrix Mj can be readily generated by taking lj admissible 
columns of Q* with largest norms. When done with respect to the Euclidean norm, 
the Frobenius norm of the submatrix is maximized instead. This can be argued, of 
course, by the fact that 

=pr= = pr \\Mj\\ F < \\Mj\\ < \\Mj\\ F 
y max\Kj , t;} 

holds. 

This approach starts with W and then yields V (sub)optimally. This process can 
be used to assess how W was initially chosen. Let us illustrate this with the following 
example. 

Example 3.1. The choice of upper (lower) triangular matrices for V has the advan- 
tage that then we have a warning signal in case W is poorly chosen. Namely, suppose 

V has been (sub)optimally constructed as just described. If the factor V computed to 
satisfy (|1.2|) is poorly conditioned, one should consider updating the sparsity structure 
of VV to have a matrix subspace which better suited for approximate factoring of the 
inverse of v4@ 

In this optimization scheme, let us illustrate how the matrix subspace W actually 
could be poorly chosen. Namely, the way the above optimization scheme is set up 
means that the sparsity structure of W should be such that no two columns share the 
same sparsity structure. (Otherwise V will have equaling columns.) Of course, this 
may be too restrictive. In the section that follows, a way to circumvent this problem 
is devised by stabilization. 

3.2. Optimizing under additional constraints. There are instances which 
require imposing additional constraints in computing the factors. Aside from the 
problems described above, in tough problems the approximate factors may be poorly 
conditioned of even singular Because there holds 

llAJW- 1 -III 

" " < UW - V|| < \\AWV- 1 - I\\ \\V\\, (3.1) 

this certainly cannot be overlooked. To overcome this, it is advisable to stabilize the 
computations by appropriately modifying the optimality conditions in computing the 
factors. 

For the first basic algorithm this means a refined computation of V. Thereafter 
the factor W is computed columnwise as before to satisfy the conditions (|2.4p . For 
a case in which the conditioning is readily controlled, consider a matrix subspace V 



4 This is actually the case in the (numerically) exact factoring: To recover whether a matrix 
A a C nxn is nonsingular, it is advisable to compute its partially pivoted LU factorization, i.e., use 
a numerically reliable algorithm. 

5 This is a well-known phenomenon in preconditioning. For ILU factorization there are many 
ways to stabilize the computations pp. Stabilization has turned out to be indispensable in practice. 
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belonging to the set of upper (or lower) triangular matrices. Then, suppose the jth 
column vj computed to satisfy (I2.3[) results in a tiny jth component. To stabilize 
the computations for the hrst basic algorithm, we replace Vj by first imposing the jth 
component of Vj to equal a constant rj > 0. For the remaining components, let Mj be 
a submatrix consisting of the lj — 1 largest columns of Q* among its first j— 1 columns. 
Denote the jth column of Q* by pj . Then consider the optimization problem 



max 

|fi 3 -||2=l 



r jPj + MjVj 



(3.2) 



By invoking the singular value decomposition Mj — UjYjjV* of Mj, this is equivalent 
to solving 



max 

\v, I 2 = 1 



jPj + ^jVj 



(3.3) 



where pj = UjPj and Vj — V*Vj. Consequently, choose Vj = (e l9 , 0, 0, . . . , 0), where 
9 is the argument of the first component of pj. (If the first component is zero, then 
any 9 will do.) Set the column Vj to be the sum of Tj&j and the vector obtained after 
putting the entries of VjVj at the positions where the corresponding lj — 1 largest 
columns of Q* appeared. (Here ej denotes the jth standard basis vector of C n .) 

Observe that the solution does not depend on the value of rj > 0. In particular, 
it is not clear how large Tj should be. 

Again it is instructive to contrast this with the approximate inverse computations. 

Example 3.2. The sparse approximate inverse computations yield the simplest 
case of imposing additional constraints as just described. That is, the sparse approx- 
imate inverse computations can be interpreted as having lj — 1 for every column, 
combined with imposing rj = 1. 

The LU factorization and thereby triangular matrices are extensively used in 
preconditioning. Because the LU factorization without pivoting is unstable, some kind 
of stabilization is needed. It is clear that the QR factorization also gives reasons to look 
at triangular matrices. The approach differs from that of using the LU factorization 
in that its computation does not require a stabilization, i.e., nothing like partial 
pivoting is needed. Of course, our intention is not to propose computing the full QR 
factorization. Understanding the Q factor is critical as follows. 

Example 3.3. The QR factorization A* = QR of the Hermitian transpose of A can 
be used as a starting point to construct matrix subspaces for approximate factoring 
of the inverse. Namely, we have AQ — R* . Therefore V belonging to the set of lower 
triangular matrices is a natural choice. For W one needs to generate an approximation 
to the sparsity structure of Q. For this there are many alternatives. 

Aside from upper (lower) triangular matrices, the are, of course, completely dif- 
ferent alternatives. Consider, for example, choosing V among diagonally dominant 
matrices. Since the set of diagonally dominant matrices is not a matrix subspace, 
dealing with this structure requires using constraints. It is easy to see that the prob- 
lem can be tackled completely analogously, by imposing imposing r > 1 to hold for 
every diagonal entry. Thereafter (|3.2|) solved for having the other components in 
the column. The inversion of V can be performed by simple algorithms such as the 
Gauss-Seidel method. 



8 



M. BYCKLING AND M. HUHTANEN 



4. Constructing the matrix subspace W. Optimally constructing the matrix 
subspace W for approximate factoring of the inverse appears seemingly challenging. 
Some ideas are suggested in what follows, although no claims concerning the optimal- 
ity are made. We suggest starting the process by taking an initial standard matrix 
subspace Vo which precedes the actual V. Once W has been as constructed, then Vo 
should be replaced with V computed with the techniques introduced in Section [3J 

4.1. The Neumann series constructions. For approximate inverse compu- 
tations the selection of an a-priori sparsity pattern is a well-known problem [7J [2] . 
Good sparsity patterns are, at least in some cases, related to the transitive closures 
of subsets of the connectivity graph of G(A) of A. This can also be interpreted as 
computing level set expansions on the vertices of a sparsified G(A). 

In [7] numerical dropping is used to sparsify G(A) or its level set expansions. 
Denote by v € C" a vector with entries Vj. To select the relatively large entries of v 
numerically, entries are dropped by relative tolerance r and by count p, i.e., only those 
entries of v that are relatively large with the restriction of p largest entries at most are 
stored. (Note that the diagonal elements are not subjected to numerical dropping.) 
In what follows, these rules are referred to as numerical dropping by tolerance and 
count. 

The dropping can be performed on an initial matrix or during the intermediate 
phases of the level set expansion. Thus we have two sets of parameters (n,pi) con- 
trolling the initial sparsihcation and (r^pi) controlling the sparsihcation during level 
set expansion. In addition, we adopt the convention that setting any parameter as 
zero implies that the dropping parameter is not used. 

With these preparations for approximate factoring of the inverse, take an ini- 
tial standard matrix subspace Vo and consider generating a sparsity pattern for W. 
Assuming Vq = Pv A € Vo is invertible, we have 



Whenever < 1, there holds A' 1 = (I + E^Li S j ) V^ 1 = WVq 1 by invoking the 
Neumann series. Therefore then 



Although the assumption ||5|| < 1 is generally too strict in practice, we may formally 
truncate the series (|4.ip to generate a sparsity pattern. To make this economical and 
to retain W sparse enough, compute powers of S only approximately by using sparse- 
sparse operations combined with numerical dropping and level of fill techniques. 

Observe that, to operate with the series (|4.ip we need S = V^ l (I — Py )A It is 
this which requires setting an initial standard matrix subspace Vo- 

Example 4.1. For S = V^ l (I - P Vo )A we need to set an initial standard matrix 
subspace. The most inexpensive alternative is to take Vo to be the set of diagonal 
matrices. Then Vo = Pv a A is an immediately found. 

There are certainly other inexpensive alternatives for Vo, such as block diagonal 
matrices. Once fixed, thereafter the scheme can be given as Algorithm [T] below. 

Note that final step of Algorithm [T] is to keep the intersection of W and Vo empty 
apart from the diagonal; see Section 12.31 After the sparsity structure for a matrix 
subspace W has been generated, the sparsity structure of Vo can be updated to be V 
by using W. 



A = V (I - V -\I - P Vo )A) = MI - S). 



OO 




(4.1) 
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Algorithm 1 Sparsified powers for constructing W 

1: Set a truncation parameter k 

2: Compute Vq 

3: Compute S = V Q ~ 1 (I- P Vq )A 

4: Apply numerical dropping by tolerance and count to columns of S 

5: for columns j in parallel do 

6: Set Sj = tj = ej 
7: for I = 1, . . . , k do 
8: Compute tj = Stj 

9: Apply numerical dropping by tolerance and count to tj 

10: Compute Sj = Sj + tj 

11: end for 

12: Set sparsity structure of Wj to be the sparsity structure of Sj 

13: end for 

14: Set W = W\{V Q \1} 



4.2. Algebraic constructions. Next we consider some purely algebraic argu- 
ments which might be of use in constructing W. Again start with an initial standard 
matrix subspace Vq- Take the sparsity structure of the jth column of Vo and consider 
the corresponding rows of A E C nxn . Choose the sparsity structure of the jth column 
of W to be the union of the sparsity structures of these rows. This is a necessary (but 
not sufficient) condition for AW to have an intersection with Vo- This simply means 
choosing W to have the sparsity structure of A*Vo- 

Most notably, the process is very inexpensive and can be executed in parallel. 
One only needs to control that the columns of W remain sufficiently sparse. With 
probability one, the following algorithm yields the desired sparsity structure. 



Algorithm 2 Computing a sparsity structure for W 
Require: A sparse matrix A € c nx ™ ancl a random column Vj € Vo- 
Ensure: Sparsity structure of the column Wj. 
Compute w = A*Vj 
2: if w is not sparse enough then 

Sparsify w to have the sparsity structure of Wj. 
4: end if 

Take the sparsity structure of Wj to be the sparsity structure of w. 



Observe that we do not have A*Vq = W since the computation is concerned with 
sparsity structures. 

Approximate inverse preconditioning corresponds to choosing Vo to be the set of 
diagonal matrices. Then the sparsity structure of W equals that of A* . The following 
two examples illustrate two extremes cases of this choice. 

Example 4-2. Take Vo to be the set of diagonal matrices. Then the first basic 
algorithm reduces to the approximate inverse preconditioning. Algorithm [2] yields 
now a standard matrix subspace W whose sparsity structure equals that of A* . This 
can yield very good results. If A has orthogonal rows (equivalently, columns) then 
and only then this gives exactly a correct matrix subspace W for factoring the inverse 
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of A as AWV" 1 = / when V is taken to be V o 

Having identified an ideal structure for the approximate inverse preconditioning 
when W is constructed with Algorithm [21 how about when A is far from being a 
scaled unitary matrix? An upper (lower) triangular matrix is a scaled unitary matrix 
only when it reduces to a diagonal matrix. 

Example 4-3- Take again Vo to be the set of diagonal matrices. Then the basic 
algorithm reduces to the approximate inverse preconditioning. Algorithm [2] yields a 
standard matrix subspace W whose sparsity structure equals that of A*. This yields 
very poor results if A is an upper (lower) triangular matrix. Namely, then its inverse 
is also upper (lower) triangular. 

Algorithm [2] is set up in such a way that if Vo C Vo, then W C VV. Thereby 
matrix subspaces can be constructed to handle the two extremes of Examples 14.21 and 
14.31 simultaneously. 

In practice Vo should be more complex, i.e., the set of diagonal matrices is a too 
simple structure. One option is to start with Vo having the sparsity structure of the 
Gauss-Seidel preconditioner. 

Definition 4.1. A standard matrix subspace V of C nx " is said to have the spar- 
sity structure of the Gauss-Seidel preconditioner of A £ (£" x ™ {j n0 nzero entries 
in V appear on the diagonal and there where the strictly lower (upper) triangular part 
of A has nonzero entries. 

5. Numerical experiments. The purpose of this final section is to illustrate, 
with the help of four numerical experiments, how the preconditioners devised in Sec- 
tions [H and [3] perform in practice. Since there is an abundance of degrees of freedom 
to construct matrix subspaces for approximate factoring of the inverse, only a very 
incomplete set of experiments can be presented. In particular, we feel that there is a 
lot of room for new ideas and improvements. 

In choosing the benchmark sparse linear systems, we used the University of 
Florida collection [9|. The problems were selected to be the most challenging ones 
to precondition among those tested in [3]. For the matrices used and some of their 
properties, see Table I5TT1 Assuming the reader has an access to [3J, the comparison be- 
tween the methods proposed here and the diagonal Jacobi preconditioning, ILUT(O), 
ILUT(l), ILUT and AINV can be readily made. For a comparision between ILUs and 
AINV, see, e.g., [5J. 

Regarding preprocessing, in each experiment the original matrix has been initially 
permuted to have nonzero diagonal entries and scaled with MC64. (See [TU] for MC64.) 
It is desirable that the matrix subspace V contains hierarchically connected parts of 
the graph of the matrix. To this end we use an approach to find the strongly connected 
subgraphs of the matrix; see Duff and Kaya ill . We then obtain a permutation P 
such that after the permutations, the resulting linear system can be split as 

Ax = (L + D + U)x = b, (5.1) 

where L T and U are strictly block upper triangular and I? is a block diagonal matrix. 
The construction of this permutations consumes at most 0(n log (ro)) operations^ 

6 In view of this, it seems like a natural problem to ask, how well A can be approximated with 
matrices of the form DU with D diagonal and U unitary. 

Preprocessing is actually a part of the process of constructing the matrix subspaces W and V. 
That is, it is insignificant whether one orders correspondingly the entries of the matrix or the matrix 
subspaces. 
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Problem 


Area 


n 


nz(A) 


ki = xyl(A)/ti 


west 1505 


Chemical engineering 


1505 


5414 


3.6 


west2021 


Chemical engineering 


2021 


7310 


3.62 


lhr02 


Chemical engineering 


2954 


36875 


12.5 


bayerlO 


Chemical engineering 


13436 


71594 


5.33 


sherman2 


PDE 


1080 


23094 


21.4 


gematll 


Linear programming 


4929 


33108 


6.72 


gematl2 


Linear programming 


4929 


33044 


6.7 


utm5940 


PDE 


5940 


83842 


14.1 


e20rl000 


PDE 


4241 


131430 


31 



Table 5.1 

Matrices of the experiments, their application area, size, number of nonzeros and density. 



In the experiments, the right-hand side b G C" in (|5.1|) was chosen in such a way 
that the solution of the original linear system was always x — (1,1, ... ,1). As in [3] , 
as a linear solver we used BiCGSTAB [17]. The iteration was considered converged 
when the initial residual had been reduced by eight orders of magnitude. 

The numerical experiments were carried out with Matlatd. 



Example 5.1. We compare the minimization algorithm presented [4] (PAIF) with 
the QR factorization based minimization algorithm of Section 12.11 (DIAF-Q). We 
construct W with the heuristic Algorithm [1] of Section I4TT1 For all test matrices, we 
use k = 3 and Tj = IE — 1, pi = 0, t; = and p\ — 0, as parameters. For PAIF, 80 
refinement iterations were always used which is a somewhat more than what we have 
found to be necessary in practice. However, we want to be sure that the comparison 
is descriptive in terms of the quality of the preconditioner. 

We choose V to be the subspace of block diagonal matrices with block bounds and 
sparsity structure chosen according to the block diagonal part of A, i.e., the matrix 
D in (|5.1I) . Then in the heuristic construction of W with Algorithm [1] Vb is taken to 
be a diagonal matrix. 

We denote by |-Dj |m the maximum blocksize of V and by #Dj the number of 
blocks in V in total. Density of the preconditioner, denoted by p, is computed as 
p = (nz(W) + nz(Ly) + nz(LV))/nz(^4), where nz(^4), nz(W), nz(L>) and nz(LV) 
denote the number of nonzeroes in A, W and the LU decomposition of V. For both 
PAIF and DIAF-Q, we also compute the condition number estimate k(V) and norm 
of the minimizer H^PF — ^||f, denoted by nrm. Finally, its denotes the number of 
BiCGSTAB iterations. By f we denote if no convergence of BiCGSTAB within 1000 
iterations. Breakdown of BiCGSTAB is denoted by |. Table [5721 shows the results. 
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Problem 


\ d j\m 


#D 3 


P 


PAIF 

k(V) 


nrm 


its 


DIAF-Q 

k(V) 


nrm 


its 


westl505 


50 


34 


2.75 


3.17E+04 


3.59 


18 


1.85E+03 


3.49 


18 


west2021 


50 


47 


2.69 


5.53E+03 


3.84 


23 


3.33E+03 


3.53 


26 


lhr02 


50 


66 


1.11 


1.65E+03 


6.69 


24 


9.05E+02 


7.01 


32 


bayerlO 


250 


67 


2.56 


8.00E+05 


22.27 


56 


2.50E+05 


14.27 


36 


sherman2 


50 


24 


1.05 


4.32E+02 


2.45 


5 


3.77E+02 


1.84 


5 


gematll 


50 


115 


1.91 


2.28E+05 


3.58 


109 


1.50E+05 


2.79 


68 


gematl2 


50 


114 


1.91 


5.96E+06 


6.87 


77 


4.58E+06 


5.20 


77 


utm5940 


250 


29 


1.73 


3.91E+06 


14.66 


295 


1.84E+06 


12.86 


221 


e20rl000 


200 


27 


4.23 


3.22E+06 


13.67 


465 


4.44E+03 


8.82 


364 



Table 5.2 

Comparison of PAIF and DIAF-Q algorithms 



Results very similar to those seen in Table 15.21 were also observed in other nu- 
merical tests that were conducted. As a general remark, the iteration counts with 
BiCGSTAB when preconditioned with DIAF-Q are not dramatically different from 
those achieved with PAIF. The main benefits of DIAF-Q are that neither the Her- 
mitian transpose of A is required in the computations nor an estimate for the norm 
of A. Moreover, DIAF-Q is a direct method, so that its computational cost is easily 
estimated, while it is not so clear when to stop the iterations with PAIF. 

The computational cost and parallel implementation of DIAF-Q is very similar 
to the established preconditioning techniques based on norm minimization for sparse 
approximate inverse. (For these issues, see [7]-) That is, DIAF-Q scales essentially 
accordingly in terms of the computational cost and parallelizability properties. 

Example 5.2. Next we compare PAIF with the SVD based algorithm of Section 
I2T21 (DIAF-S). Again W is constructed with the heuristic Algorithm Q] of Section |4T1 
All the parameters were kept the same as in the previous example, i.e., k = 3 and 
t, = IE — 1, pi = 0, 77 = and pi = 0. Also, 80 refinement steps were again used 
in the power method, so that the results for PAIF are identical to those presented in 
Example 15.11 

Table l5~3l shows the results. 



Problem 


Dj M 


#Dj 


P 


PAIF 

k(V) 


nrm 


its 


DIAF-S 

k(V) 


nrm 


its 


westl505 


50 


34 


2.75 


3.17E+04 


3.59 


18 


4.06E+03 


3.04 


14 


west2021 


50 


47 


2.69 


5.53E+03 


3.84 


23 


8.31E+03 


3.26 


27 


lhr02 


50 


66 


1.11 


1.65E+03 


6.69 


24 


1.90E+03 


5.68 


55 


bayerlO 


250 


67 


2.56 


8.00E+05 


22.27 


56 


9.50E+05 


11.68 


46 


sherman2 


50 


24 


1.05 


4.32E+02 


2.45 


5 


4.33E+02 


1.67 


5 


gematll 


50 


115 


1.91 


2.28E+05 


3.58 


109 


2.28E+05 


2.90 


113 


gematl2 


50 


114 


1.91 


5.96E+06 


6.87 


77 


1.51E+08 


3.72 


201 


utm5940 


250 


29 


1.73 


3.91E+06 


14.66 


295 


3.91E+06 


7.43 


t 


e20rl000 


200 


27 


4.23 


3.22E+06 


13.67 


465 


1.96E+04 


10.37 


444 



Table 5.3 

Comparison of PAIF and DIAF-S algorithms 



The results of Table IOI with DIAF-S are very similar to those in Table [Ol The 
only notable exception is the matrix utm5940, for which no convergence was achieved 
with DIAF-S. With the metrics used, we do not quite understand why DIAF-S fails 
to produce a good preconditioner for this particular problem. The computed norm 
IjAVF— V\\f is smaller than the one attained with DIAF-Q and the condition number 
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estimate is only slightly worse. The reason is most likely related with the fact that 
the bound (|3.1I) cannot be expected to be tight enough when k(V) is large. 

The following example illustrates how the matrix subspace V can be optimally 
constructed with the techniques of Section [3] 

Example 5.3. In this example we consider an optimal construction of V. To 
this end, we first construct W with the heuristic Algorithm [T] presented in Section 
14.11 Then, to construct V, we apply the techniques presented in Section [3] After 
the sparsity structures of the subspaces have been fixed, the resulting minimization 
problem is solved with DIAF-Q. 

Consider the minimization problem (|2.4j) . If no restrictions on the number of 
nonzero entries in a matrix subspace V are imposed, the norm HAT4 7 — V\\f can be 
decreased by choosing as many entries as possible from the sparsity structure of AW 
to be in the sparsity structure of To illustrate this, we take V to be a subspace 
of block digonal matrices by allowing only certain degree of sparsity ky per column. 
The nonzero entries arc chosen with the techniques of Section [3] 

We again set k = 3 and Tj = IE — 1, pi = 0, tj = and pi — 0, as parameters for 
all test matrices. To have the locations for the entries in the diagonal blocks of V, we 
then apply the method presented in Section [3l Subspace W is constructed with the 
heuristic Algorithm [T] by setting Vo to be a subspace of block diagonal matrices with 
full blocks. This is to ensure that intersection of the final V and W is empty. 

Table 15.31 shows the results, where at most fey entries in each column of the 
sparsity pattern of V have been allowed. For each test problem we have used the 
same block structure as in Examples 15.11 and 15.21 only the locations of the nonzero 
entries in W and V is varied. 
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k v = 


30 






fev = 


50 






Problem 


P 


k(V) 


nrm 


its 


P 


K(V) 


nrm 


its 


P 


k(V) 


nrm 


its 


wcstl505 


2.81 


2.24E+03 


3.54 


20 


2.83 


4.57E+03 


3.36 


20 


2.83 


4.57E+03 


3.36 


20 


wcst2021 


2.75 


5.11E+04 


3.70 


30 


2.76 


5.67E+04 


3.46 


31 


2.76 


5.67E+04 


3.46 


31 


lhr02 


1.15 


1.07E+04 


6.95 


47 


1.17 


1.27E+04 


6.92 


43 


1.17 


1.27E+04 


6.92 


43 


baycrlO 


2.68 


6.72E+07 


13.83 


104 


2.75 


1.86E+05 


13.42 


29 


2.76 


1.86E+05 


13.42 


31 


sherman2 


1.01 


1.49E+02 


1.88 


7 


1.11 


3.77E+02 


1.82 


5 


1.11 


3.77E+02 


1.82 


5 


gcmatll 


2.14 


1.50E+05 


2.84 


81 


2.24 


1.50E+05 


2.75 


68 


2.24 


1.50E+05 


2.75 


69 


gematl2 


2.09 


4.24E+06 


5.21 


80 


2.17 


4.58E+06 


5.13 


71 


2.17 


4.58E+06 


5.13 


70 


utm5940 


2.11 


1.82E+06 


13.12 


i 


2.49 


2.09E+06 


12.72 


209 


2.51 


2.11E+06 


12.71 


201 


c20rl000 


3.77 


2.16E+06 


20.66 


i 


4.99 


1.01E+05 


11.76 


t 


5.49 


6.67E+04 


8.77 


418 



Table 5.4 

Adaptive selection of V for different values of fcv 



As seen in Table [ST4l choosing more entries in V, i.e., increasing k\> always im- 
proves the norm of the minimizer, which is well supported by the theory. Allowing 
more entries in V generally produces a better preconditioner. In a few cases where a 
slightly worse convergence can be observed, we also observe a slightly worse condition 
number estimate for the computed V. 

The final example illustrates the optimal selection of a block upper triangular 
subspace V as well as optimization under additional constraints. 

Example 5.4- We consider optimal construction V in the case where V is block 
upper triangular. As parameters we again use k = 3 and Tj = IE — 1, pi = 0, 



9 For example, V can never be the full set set of upper triangular matrices since it would require 
storing 0(ra 2 ) complex numbers. The problem is then, how to choose a subspace V of upper triangular 
matrices. 
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Ti = and pi = and use the strongly connected subgraph approach to have a block 
structure for the subspace V. 

To have locations for the entries in the block upper triangular V, we apply the 
method presented in Section [3l As in Example l5.3[ we construct W with Algorithm [T] 
by setting Vo as a subspace of block upper triangular matrices with full blocks. The 
resulting W is a lower triangular matrix subspace consisting of a diagonal part and 
a strictly block lower triangular part. The resulting minimization problem is solved 
with DIAF-Q. Note that by the structure of such a subspace, the conditioning of 
W 6 W can be readily verified. 

Table f5T3l shows the results, where at most fcy entries in each column in the block 
upper triangular part of Vo have been allowed. The used block structure is the same 
as in Examples 15.11 15.21 and 15.31 only the locations and the number of the nonzero 
entries is varied in W and V. 





fcy — 


10 






fey = 


30 






fey = 


100 






Problem 


P 


K(V) 


nrm 


its 


P 


«(V) 


nrm 


its 


P 


n(V) 


nrm 


its 


wcstl505 


2.37 


4.45E+06 


4.79 


860 


2.68 


1.88E+06 


2.52 


78 


2.75 


6.95E+04 


2.42 


24 


wcst2021 


2.31 


1.66E+11 


5.34 


t 


2.57 


8.30E+05 


2.53 


67 


2.66 


3.25E+05 


2.27 


23 


lhr02 


1.03 


2.45E+03 


4.22 


36 


1.40 


4.89E+03 


3.21 


22 


1.46 


4.93E+03 


3.18 


21 


bayerlO 


2.09 


3.74E+09 


11.76 


963 


2.42 


2.90E+06 


8.19 


390 


2.49 


3.39E+06 


8.10 


16 


shcrman2 


0.89 


1.61E+02 


0.89 


6 


1.24 


4.66E+02 


0.63 


3 


1.24 


4.66E+02 


0.63 


3 


gcmatll 


1.84 


1.50E+05 


2.37 


55 


1.96 


1.60E+05 


1.74 


32 


1.97 


1.60E+05 


1.74 


32 


gematl2 


1.82 


1.45E+09 


3.37 


160 


1.95 


1.99E+09 


2.38 


39 


1.96 


2.03E+09 


2.37 


38 


utm5940 


1.60 


5.29E+07 


9.16 


653 


2.07 


1.12E+09 


8.17 


141 


2.18 


8.56E+08 


8.14 


165 


e20rl000 


1.90 


2.00E+11 


22.32 


t 


2.77 


6.59E+10 


12.48 


t 


3.85 


2.52E+09 


6.05 


792 



Table 5.5 

Adaptive selection of V for different values of ky 



The results of Table 15.51 are very similar to those of Table 15.41 Again, allowing 
more entries in V always improves the norm of the minimizer and usually also produces 
a better preconditioner. 

Similarly as in Example l5.21 it is again hard to understand why fcy = 100 produces 
a worse preconditioner than fcy = 30 for utm5940. We attribute this behaviour to the 
looseness of the bound (|3.1j) . i.e., when k(V) is large, the minimization of | — V\ \f 
may not compensate sufficiently for this in these cases. 

To improve the conditioning of V G V, we now consider the same problems using 
the technique of imposing constraints as described in Section [3] As a constraint we 
require that for the diagonal entries Vjj of V it holds \vjj\ > le — 2. In case the 
requirement is not met, we impose a constraint with r = 2. Table [5761 describes the 
results for fcy = 100, where the number of constrained columns is denoted by stab. 





fey = 


30 








Problem 


P 


k(V) 


nrm 


stab 


its 


wcstl505 


2.75 


6.16E+04 


3.06 


1 


25 


wcst2021 


2.66 


1.63E+05 


2.92 


1 


21 


lhr02 


1.46 


4.93E+03 


3.18 





21 


bayerlO 


2.49 


3.39E+06 


8.10 





16 


shcrman2 


1.24 


4.66E+02 


0.63 





3 


gematll 


1.97 


1.60E+05 


1.74 





32 


gematl2 


1.96 


2.03E+09 


2.37 





38 


utm5940 


2.18 


7.56E+06 


8.24 


1 


113 


c20rl000 


3.85 


8.66E+09 


6.27 


3 


738 



Table 5.6 
Constrained selection of V for fcy = 100 



As seen in Table 15.61 if only a small number columns has to be constrained, 
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the technique is be effective. In other numerical experiments not reported here we 
observed that if too many columns have to be constrained, the norm of the minimizer 
\\AW — V\\f tends to increase. An approach to find a right balance is needed then. 

To sum up these experiments, the iteration counts obtained with DIAF-Q and 
DIAF-S (which are fully parallelizable) seem to be competitive with the iteration 
counts obtained with the standard algebraic (sequential) preconditioning techniques. 
Moreover, a good problem specific tuning of matrix subspaces possesses a lot of po- 
tential for significantly speeding up the iterations. 
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